Skip to content

chore: Use pre-generated C data files, unify comparison scripts#257

Merged
alamb merged 4 commits into
clflushopt:cl/feat/c-dsdgen-compatfrom
alamb:alamb/nicer_tpcds_cli
May 12, 2026
Merged

chore: Use pre-generated C data files, unify comparison scripts#257
alamb merged 4 commits into
clflushopt:cl/feat/c-dsdgen-compatfrom
alamb:alamb/nicer_tpcds_cli

Conversation

@alamb
Copy link
Copy Markdown
Collaborator

@alamb alamb commented May 11, 2026

Note: This is still very much WIP, not ready for review

This targets

I totally got nerd-sniped today trying to verify the C compat mode

Instead of downloading (slightly suspicious) parquet files, I think it would be better to use the raw, unencumbered output, from dsgen (the original C program)

alamb and others added 3 commits May 12, 2026 05:34
Move per-script docs (usage, flags, env vars, output, exit codes) into the
top-of-file header comment of each script. The README's `## Scripts`
section becomes a one-line-per-script roadmap pointing readers at the
script files for details.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e in print_usage()

- Add `--compat trino|c` to generate-fixtures.sh:
  - `--compat trino` (default): existing Java generation behavior.
  - `--compat c`: download alamb/tpcds-data sfN with `git clone --depth 1`
    and extract into tests/fixtures/scale-N-c/. Supports `--rebuild` and
    `--verify`. Replaces bootstrap-c.sh, which is removed.
- Per-script header is now a one-liner + "see print_usage() below for
  details"; print_usage() sits immediately after the header with the
  full usage block (flags, env vars, examples, exit codes). Renamed
  `usage` -> `print_usage` everywhere.
- Update tpcdsgen/README.md, scripts/README.md, and the CI workflow
  (`tpcdsgen-conformance.yml`) to call generate-fixtures.sh --compat c
  instead of bootstrap-c.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a `paths` filter to both the push and pull_request triggers so the
suite no longer fires for unrelated changes (e.g. tpchgen-* edits, doc
tweaks at the repo root).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alamb alamb marked this pull request as ready for review May 12, 2026 10:41
@alamb
Copy link
Copy Markdown
Collaborator Author

alamb commented May 12, 2026

@clflushopt I am taking the liberty of merging this consolidation directly into your branch so we can get it ready to go

@alamb alamb merged commit cff27df into clflushopt:cl/feat/c-dsdgen-compat May 12, 2026
22 checks passed
alamb added a commit that referenced this pull request May 14, 2026
* feat: dsdgen compatibility

* chore: Use pre-generated C data files, unify comparison scripts (#257)

* chore: Use pre-generated C data files, unify comparison scripts

* chore: make scripts self-documenting; collapse scripts/README

Move per-script docs (usage, flags, env vars, output, exit codes) into the
top-of-file header comment of each script. The README's `## Scripts`
section becomes a one-line-per-script roadmap pointing readers at the
script files for details.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: fold bootstrap-c.sh into generate-fixtures.sh; centralize usage in print_usage()

- Add `--compat trino|c` to generate-fixtures.sh:
  - `--compat trino` (default): existing Java generation behavior.
  - `--compat c`: download alamb/tpcds-data sfN with `git clone --depth 1`
    and extract into tests/fixtures/scale-N-c/. Supports `--rebuild` and
    `--verify`. Replaces bootstrap-c.sh, which is removed.
- Per-script header is now a one-liner + "see print_usage() below for
  details"; print_usage() sits immediately after the header with the
  full usage block (flags, env vars, examples, exit codes). Renamed
  `usage` -> `print_usage` everywhere.
- Update tpcdsgen/README.md, scripts/README.md, and the CI workflow
  (`tpcdsgen-conformance.yml`) to call generate-fixtures.sh --compat c
  instead of bootstrap-c.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: only run tpcdsgen conformance on tpcdsgen/ or .github/ changes

Adds a `paths` filter to both the push and pull_request triggers so the
suite no longer fires for unrelated changes (e.g. tpchgen-* edits, doc
tweaks at the repo root).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: rename return_reasons.dst to return_reasons_trino.dst

Pair the inherited Java/Trino distribution with the corrected C variant
so both filenames advertise their compat mode at a glance:

    return_reasons_trino.dst   <-- old return_reasons.dst (carries the
                                   "reason 30 missing, reason 31
                                   duplicated" bug, kept for byte-for-byte
                                   Trino fixture stability)
    return_reasons_c.dst       <-- corrected, used by --compat c

Updates the embedded_data.rs lookup table, the distribution loader,
and the doc-comment references in scaling.rs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: consolidate MD5SUMS into scale-N-{java,c}/ fixture dirs

Move the canonical Java reference hashes from
  tests/fixtures/java/scale-{1,10}/MD5SUMS
to
  tests/fixtures/scale-{1,10}-java/MD5SUMS

and generate a fresh
  tests/fixtures/scale-1-c/MD5SUMS

from the current alamb/tpcds-data sf1 download (post-regeneration).

The old tests/fixtures/rust/scale-{1,10}/MD5SUMS files are removed:
they were byte-identical to the Java set apart from dbgen_version, which
contains a generation timestamp and is always excluded from comparison.

The empty tests/fixtures/java/ parent directory is gone too.

README references already use the new scale-N-java/ paths (from
the earlier rename); no further doc updates were needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: use 'trino' (not 'java') consistently for the Trino TPC-DS port

'java' is ambiguous — there may be multiple Java TPC-DS implementations.
The reference we target is specifically the Trino library, so name
everything after it for clarity:

  - tests/fixtures/scale-N-java/   -> tests/fixtures/scale-N-trino/
  - scripts/bootstrap-java.sh       -> scripts/bootstrap-trino.sh
  - TPCDS_JAVA_REPO  env var        -> TPCDS_TRINO_REPO
  - JAVA_DIR / JAVA_REPO_URL vars   -> TRINO_DIR / TRINO_REPO_URL
  - find_java_jar / clone_java_repo / build_java / test_java
        -> find_trino_jar / clone_trino_repo / build_trino / test_trino
  - CI artifact `test-fixtures-java` -> `test-fixtures-trino`
  - "Java fixture" log labels        -> "Trino fixture"
  - Doc references throughout READMEs and script headers updated.

Kept as-is: `actions/setup-java@v5`, `Java 11+` requirement, `java -jar`
/ `java -version` invocations, and `mvn`/`openjdk` references — those
refer to the Java language/runtime, not the Trino implementation.

The CLI flag and Rust `CompatMode::Trino` were already named `trino`;
this commit aligns the rest of the codebase.

Verified: `./scripts/test-all-tables.sh` passes 24/24 vs Trino, and
`./scripts/test-all-tables.sh --compat c` passes 23/23 vs C dsdgen
(customer.dat still skipped pending alamb/tpcds-data regeneration).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: add references to documented bugs

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant